August 28 + September 4, 2024
Important
Before next Wednesday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
What was Hilary trying to answer in her data collection?
Name two of Hilary’s main hurdles in gathering accurate data.
Which is better: high touch (manual) or low touch (automatic) data collection? Why?
What additional covariates are needed / desired? Any problems with them?
How much data does she need?
Yau (2013) gives us nine visual cues, and Wickham (2014) translates them into a language using ggplot2.
Visual Cues: the aspects of the figure where we should focus.
Position (numerical) where in relation to other things?
Length (numerical) how big (in one dimension)?
Angle (numerical) how wide? parallel to something else?
Direction (numerical) at what slope? In a time series, going up or down?
Shape (categorical) belonging to what group?
Area (numerical) how big (in two dimensions)? Beware of improper scaling!
Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
Shade (either) to what extent? how severely?
Color (either) to what extent? how severely? Beware of red/green color blindness.
Coordinate System: rectangular, polar, geographic, etc.
Scale: numeric (linear? logarithmic?), categorical (ordered?), time
Context: in comparison to what (think back to ideas from Tufte)
Visual Cues of Yau (2013):
Position (numerical)
Length (numerical)
Angle (numerical)
Direction (numerical)
Shape (categorical)
Area (numerical)
Volume (numerical)
Shade (either)
Color (either)
Attributes can focus your reader’s attention.1
ggplot2What I will try to do
give a tour of ggplot2
explain how to think about plots the ggplot2 way
prepare/encourage you to learn more later
What I can’t do in one session
show every bell and whistle
make you an expert at using ggplot2
One of the best ways to get started with ggplot is to Google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/
Look at the end of this presentation and the syllabus. More help options there.
ggplotgeom: the geometric “shape” used to display data
aesthetic: an attribute controlling how geom is displayed with respect to variables
guide: helps user convert visual data back into raw data (legends, axes)
stat: a transformation applied to data before geom gets it
| date | births | wday | year | month | day_of_year | day_of_month | day_of_week |
|---|---|---|---|---|---|---|---|
| 1978-01-01 | 7701 | Sun | 1978 | 1 | 1 | 1 | 1 |
| 1978-01-02 | 7527 | Mon | 1978 | 1 | 2 | 2 | 2 |
| 1978-01-03 | 8825 | Tue | 1978 | 1 | 3 | 3 | 3 |
| 1978-01-04 | 8859 | Wed | 1978 | 1 | 4 | 4 | 4 |
| 1978-01-05 | 9043 | Thu | 1978 | 1 | 5 | 5 | 5 |
| 1978-01-06 | 9208 | Fri | 1978 | 1 | 6 | 6 | 6 |
Two Questions:
What do we want R to do? (What is the goal?)
What does R need to know?
Goal: scatterplot = a plot with points
What does R need to know?
data source: Births78
aesthetics:
date -> xbirths -> yWhat has changed?
Now there are two layers: one with points and one with lines
The layers are placed one on top of the other: the points are below and the lines are above.
data and aes specified in ggplot() affect all geoms
This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.
If we want to set the color to be navy for all of the dots, we do it outside the aes() designation:
color = "navy" is now outside of the aesthetics list. That’s how ggplot2 distinguishes between mapping and setting.ggplot() establishes the default data and aesthetics for the geoms, but each geom may change these defaults.
good practice: put into ggplot() the things that affect all (or most) of the layers; rest in geom_XXXX()
Information gets passed to the plot via:
map the variable information inside the aes (aesthetic) command
set the non-variable information outside the aes (aesthetic) command
[1] "geom_abline" "geom_area"
[3] "geom_ash" "geom_bar"
[5] "geom_bin_2d" "geom_bin2d"
[7] "geom_blank" "geom_boxplot"
[9] "geom_bracket" "geom_col"
[11] "geom_contour" "geom_contour_filled"
[13] "geom_count" "geom_crossbar"
[15] "geom_curve" "geom_density"
[17] "geom_density_2d" "geom_density_2d_filled"
[19] "geom_density_line" "geom_density_ridges"
[21] "geom_density_ridges_gradient" "geom_density_ridges2"
[23] "geom_density2d" "geom_density2d_filled"
[25] "geom_dotplot" "geom_errorbar"
[27] "geom_errorbarh" "geom_exec"
[29] "geom_freqpoly" "geom_function"
[31] "geom_hex" "geom_histogram"
[33] "geom_hline" "geom_jitter"
[35] "geom_label" "geom_label_repel"
[37] "geom_line" "geom_linerange"
[39] "geom_lm" "geom_map"
[41] "geom_mosaic" "geom_mosaic_jitter"
[43] "geom_mosaic_text" "geom_path"
[45] "geom_pictogram" "geom_point"
[47] "geom_pointrange" "geom_polygon"
[49] "geom_pwc" "geom_qq"
[51] "geom_qq_line" "geom_quantile"
[53] "geom_rangeframe" "geom_raster"
[55] "geom_rect" "geom_ribbon"
[57] "geom_ridgeline" "geom_ridgeline_gradient"
[59] "geom_rug" "geom_segment"
[61] "geom_sf" "geom_sf_label"
[63] "geom_sf_text" "geom_signif"
[65] "geom_smooth" "geom_spline"
[67] "geom_spoke" "geom_step"
[69] "geom_stripped_cols" "geom_stripped_rows"
[71] "geom_text" "geom_text_repel"
[73] "geom_tile" "geom_tufteboxplot"
[75] "geom_violin" "geom_vline"
[77] "geom_vridgeline" "geom_waffle"
help pages will tell you their aesthetics, default stats, etc.
geom_areageom_areaMost (all?) graphics are intended to help us make comparisons
Key plot metric
Does my plot make the comparisons I am interested in:
HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial. Subjects admitted for treatment for addiction to one of three substances.
| age | anysubstatus | anysub | cesd | d1 | daysanysub | dayslink | drugrisk | e2b | female | sex | g1b | homeless | i1 | i2 | id | indtot | linkstatus | link | mcs | pcs | pss_fr | racegrp | satreat | sexrisk | substance | treat | avg_drinks | max_drinks | hospitalizations |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 1 | yes | 49 | 3 | 177 | 225 | 0 | NA | 0 | male | yes | housed | 13 | 26 | 1 | 39 | 1 | yes | 25.11 | 58.4 | 0 | black | no | 4 | cocaine | yes | 13 | 26 | 3 |
| 37 | 1 | yes | 30 | 22 | 2 | NA | 0 | NA | 0 | male | yes | homeless | 56 | 62 | 2 | 43 | NA | NA | 26.67 | 36.0 | 1 | white | no | 7 | alcohol | yes | 56 | 62 | 22 |
| 26 | 1 | yes | 39 | 0 | 3 | 365 | 20 | NA | 0 | male | no | housed | 0 | 0 | 3 | 41 | 0 | no | 6.76 | 74.8 | 13 | black | no | 2 | heroin | no | 0 | 0 | 0 |
| 39 | 1 | yes | 15 | 2 | 189 | 343 | 0 | 1 | 1 | female | no | housed | 5 | 5 | 4 | 28 | 0 | no | 43.97 | 61.9 | 11 | white | yes | 4 | heroin | no | 5 | 5 | 2 |
| 32 | 1 | yes | 39 | 12 | 2 | 57 | 0 | 1 | 0 | male | no | homeless | 10 | 13 | 5 | 38 | 1 | yes | 21.68 | 37.3 | 10 | black | no | 6 | cocaine | no | 10 | 13 | 12 |
| 47 | 1 | yes | 6 | 1 | 31 | 365 | 0 | NA | 1 | female | no | housed | 4 | 4 | 6 | 29 | 0 | no | 55.51 | 46.5 | 5 | black | no | 5 | cocaine | yes | 4 | 4 | 1 |
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Every geom comes with a default stat
stat_identity() which does nothingEvery stat comes with a default geom, every geom with a default stat
Using color and linetype:
Boxplots use stat_quantile() (five number summary).
The quantitative variable must be y, and there must be an additional x variable.
Horizontal boxplots are obtained by flipping the coordinate system:
coord_flip() may be used with other plots as well to reverse the roles of x and y on the plot.We can scale the continuous axis
We’ve triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually.
[1] 10000 76
One way to deal with overplotting is to set the opacity low.
Alternatively (or simultaneously) we might prefer a different geom altogether.
coords (coord_flip() is good to know about)
themes (for customizing appearance)
position (position_dodge(), position_jitterdodge(), position_stack(), etc.)
transforming axes
jitterdodge()ggplot(data = HELP_data, aes(x = substance, y = age, color = children)) +
geom_boxplot(coef = 10, position = position_dodge(width=1)) +
geom_point(aes(fill = children), alpha=.5,
position = position_jitterdodge(dodge.width=1, jitter.width = 0.2)) +
facet_wrap(~homeless)+
ggtitle("HELP clinical trial at detoxification unit")R for Data Science by Hadley Wickham and Garrett Grolemund
shiny
interactive graphics / modeling
https://shiny.rstudio.com/
plotly
Plotlyis an R package for creating interactive web-based graphs via plotly’s JavaScript graphing library,plotly.js. TheplotlyR libary contains theggplotlyfunction , which will convertggplot2figures into a Plotly object. Furthermore, you have the option of manipulating the Plotly object with thestylefunction.
gganimate
Tufte (1997) Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
Make the data stand out
Facilitate comparison
Add information
Nolan & Perrrett, 2016
Tufte lists two main motivational steps to working with graphics as part of an argument.
“An essential analytic task in making decisions based on evidence is to understand how things work.”
Making decisions based on evidence requires the appropriate display of that evidence.”
How many aspects of this graph can you point out which are relevant to figuring out that cholera infection was coming from a single pump? Are there any distracting aspects?
Why would the outbreak already have begun to decline before the pump handle was removed?
One of the graphics which was particularly unconvincing in trying to explain that O-rings fail in the cold.
A different graph of the Challenger information, now sorted by temperature
The graphic the engineers should have led with in trying to persuade the administrators not to launch. It is evident that the number of O-ring failures is quite highly associated with the ambient temperature. Note the vital information on the x-axis associated with the large number of launches at warm temperatures that had zero O-ring failures.
image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked
image credit: https://www.darkhorseanalytics.com/portfolio-data-looks-better-naked
One in 5,000, NYT, D. Leonhardt 9/7/21; image credit: https://www.nytimes.com/2021/09/07/briefing/risk-breakthrough-infections-delta.html
W.E.B. DuBois (1868-1963)
In 1900 Du Bois contributed approximately 60 data visualizations to an exhibit at the Exposition Universelle in Paris, an exhibit designed to illustrate the progress made by African Americans since the end of slavery (only 37 years prior, in 1863).
https://drawingmatter.org/w-e-b-du-bois-visionary-infographics/